STATS 32: Introduction to R for Undergraduates

Elena Tuzhilina

Oct 14, 2021

http://web.stanford.edu/~elenatuz/courses/stats32-aut2021/

Recap of session 7

File paths and working directories

File paths and working directories

Factors

Functions for factors

All these functions are part of the forcats package, which is automatically loaded when you load the tidyverse package.

Agenda for today

RStudio

When you open RStudio, you should see something like this:


There should be 3 different windows along with a number of tabs.

RStudio

When you open RStudio, you should see something like this:


R console in Left: can run commands in an interactive fashion. Type a command and hit the Enter key.

RStudio

When you open RStudio, you should see something like this:


Environment in Top-right: list of objects that we have access to.

RStudio

When you open RStudio, you should see something like this:


Files in Bottom-right: allows you to navigate the directory structure on your computer.

RStudio

When you open RStudio, you should see something like this:


Plots in Bottom-right: any graphical output you make will be displayed here.

RStudio

When you open RStudio, you should see something like this:


Help in Bottom-right: documentation for functionName appears here when you type ?functionName in the console.

RStudio

When you open RStudio, you should see something like this:


Top-left: nothing so far, but potentially your R script or R markdown.

You can store your code in: R scripts

To create R script: click in the top-left corner of the window, and click “R Script”.

Types of information in R scripts

To execute the code: highlight the code and click at the top of the window (or Cmd-Enter on a Mac, Ctrl-Enter on Windows).

The output will appear in Console or Plots section.

You can store your code in: R markdown

To create R markdown: click in the top-left corner of the window, and click “R Markdown”.

Types of information in R markdown

Code chunks in R markdown

To inserts a new code chunk: click Option-Cmd-I on a Mac or Ctrl-Alt-I on Windows.

To execute the code: click on the green arrow in the top-right corner of the code chunk.

The output will appear below the code chunk.

Common Rmd chunk options

Text chunks in R markdown

Add your description outside the code chunk.

Can add:

Markdown reference here.

Rmd workflow (basic)

  1. Edit .Rmd file in RStudio.
  2. Knit the document (either by hitting the button or using a keyboard shortcut).
    • When you press “Knit”, the file is automatically saved.
    • Next, RStudio opens a new console, “knits” the document there, then closes that console. No code is run in your original console!
    • RStudio creates a .html file in the same folder as the .Rmd file.
  3. Preview output in the preview pane, or by opening the .html file.
    • If you want to make changes, go back to Step 1.

Exercise 1

What is a variable?

x <- 3
x <- 3
x <- 3
y <- "abc"
x <- 3
y <- "abc"
x <- 3
y <- "abc"
y <- 5
x <- 3
y <- "abc"
y <- 5
x <- 3
y <- "abc"
y <- 5
x <- y
x <- 3
y <- "abc"
y <- 5
x <- y

Basic objects in R: Vectors

vec <- c("a", "b", "c")
vec
## [1] "a" "b" "c"

Basic objects in R: Vectors

vec <- 1:10
vec
##  [1]  1  2  3  4  5  6  7  8  9 10
vec*2-1
##  [1]  1  3  5  7  9 11 13 15 17 19
vec^2
##  [1]   1   4   9  16  25  36  49  64  81 100

Basic objects in R: Vectors

To extract a subset of elements by their indices, put a vector of indices in square brackets

vecsq <- vec^2
vecsq
##  [1]   1   4   9  16  25  36  49  64  81 100

Continuous chunk (all elements from 3 to 7)

vecsq[3:7] 
## [1]  9 16 25 36 49

Just some elements (elements 3 and 5)

vecsq[c(3,5)]
## [1]  9 25

All except (elements 3 and 5)

vecsq[-c(3, 5)]
## [1]   1   4  16  36  49  64  81 100

Basic objects in R: Matrices

Two-dimensional analogs of vectors

A <- matrix(1:12, nrow = 3, ncol = 4)
A
##      [,1] [,2] [,3] [,4]
## [1,]    1    4    7   10
## [2,]    2    5    8   11
## [3,]    3    6    9   12

To extract a subset of elements put the row numbers then column number.

One element (1st row, 2nd column)

A[1, 2]
## [1] 4

One row (3rd row)

A[3, ]
## [1]  3  6  9 12

A block (1-3 rows and 1-3 cols)

A[1:3, 1:3]
##      [,1] [,2] [,3]
## [1,]    1    4    7
## [2,]    2    5    8
## [3,]    3    6    9

Basic objects in R: Lists

cars <- list(make = "Honda", 
             models = c("Fit", "CR-V", "Odyssey"), 
             available = c(TRUE, TRUE, TRUE))

Basic objects in R: Lists

To extract parts of a list use [[ or $ notation to refer to a specific key-value pair

cars$make         
## [1] "Honda"
cars[["models"]]
## [1] "Fit"     "CR-V"    "Odyssey"

Basic objects in R: Data Frames

Data structure for storing datasets.

mtcars
mpg cyl disp hp drat wt qsec vs am gear carb
Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2
Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1

Basic objects in R: Data Frames

Data frame is has a list structure.

mtcars$mpg
##  [1] 21.0 21.0 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 17.8 16.4 17.3 15.2 10.4
## [16] 10.4 14.7 32.4 30.4 33.9 21.5 15.5 15.2 13.3 19.2 27.3 26.0 30.4 15.8 19.7
## [31] 15.0 21.4

And a matrix structure.

mtcars[1,2]
## [1] 6

Basic objects in R: Data Frames

Useful commands:

Exercise 2

Answer these questions about mtcars:

Add the answer to your R markdown (code + comments).

ggplot essential elements of graphics: data, geometries, aesthetics

Geometries: Visual elements used for our data

Here we use geom_point().

ggplot essential elements of graphics: data, geometries, aesthetics

Aesthetics: Defines the data columns which affect various aspects of the geom. Depend on the geometries you choose.

Here we use three aesthetics:

(shape, size, etc. take on default values, not determined by data)

Examples of other aesthetics

Examples of other aesthetics

ggplot2 code

ggplot()

ggplot2 code

ggplot() +
    geom_histogram(data = df, mapping = aes(x = mpg))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

ggplot2 code

ggplot() +
    geom_boxplot(data = df, mapping = aes(x = cylinders, y = mpg))

ggplot2 code

ggplot() +
    geom_point(data = df, 
               mapping = aes(x = weight, y = mpg, col = cylinders),
               shape = 15)

Note that shape is outside the aesthetics: although it controls visual properties of the plot, it has nothing to do with the data. Aesthetics links your data to the plot visual properties.

Two types of variables in data frames

Why do we care? It helps you to choose the plot type.

Barplots: counts for a categorical variable

What is the distribution of cylinders in my dataset?

ggplot() +
    geom_bar(data = df, mapping = aes(x = cylinders)) +
    ggtitle("Count by cylinders") +
    xlab("No. of cylinders")

Note that y is automatically set to counts.

Histograms: counts for a continuous variable

What is the distribution of miles per gallon in my dataset?

ggplot() + 
    geom_histogram(data = df, mapping = aes(x = mpg)) +
    ggtitle("Histogram of miles per gallon")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Scatterplots: continuous variable vs. continuous variable

What is the relationship between mpg and weight?

ggplot() + 
    geom_point(data = df, mapping = aes(y = mpg, x = weight), size = 2) + 
    ggtitle("Miles per gallon vs. weight")

Lineplots: continuous variable vs. time variable

What is the relationship between mpg and time?

ggplot() +
    geom_line(data = vehicles, mapping = aes(y = `mean highway mpg`, x = year)) +
    ggtitle("Mean highway mpg by year")

Easier to see the trend.

Boxplots & violin plots: continuous variable vs. categorical variable

For each value of cylinder, what is the distribution of mpg like?

ggplot() + 
    geom_boxplot(data = df, aes(cylinders, mpg)) +
    ggtitle("Distribution of mpg by cylinders")

ggplot() + 
    geom_violin(data = df, aes(cylinders,mpg)) +
    ggtitle("Distribution of mpg by cylinders")

Heatmaps: categorical variable vs. categorical variable

How often does each pair of cylinder and gear occur in the dataset?

ggplot() + 
    geom_tile(data = df, mapping = aes(y = gear, x = cylinders, fill = count)) + 
    ggtitle("Distribution of (cylinder, gear)")

Layers: Combining multiple plots into one graphic

We can have more than one layer in a graphic.

= +

Each layer contains (essentially):

ggplot2 code

ggplot() +
    geom_boxplot(data = df, mapping = aes(x = cylinders, y = mpg)) +
    geom_point(data = df, mapping = aes(x = cylinders, y = mpg), 
               position = "jitter")

ggplot2 code

When layers share attributes, we only have to type them once:

ggplot(data = df, mapping = aes(x = cylinders, y = mpg)) +
    geom_boxplot() +
    geom_point(position = "jitter")

ggplot2 code

ggplot(df, aes(x = cylinders, y = mpg)) +
    geom_boxplot() +
    geom_point(position = "jitter")

Facets

ggplot(data = df, mapping = aes(y = mpg, x = weight)) + 
  geom_point(aes(col = cylinders), size = 2) + 
  facet_wrap(~cylinders, ncol = 1)

Facets

ggplot(data = df, mapping = aes(y = mpg, x = weight)) + 
  geom_point(aes(col = cylinders), size = 2) + 
  facet_grid(gear~cylinders, labeller = label_both)

Other useful options

Exersise 3

In the mtcars data:

Add the answer to your R markdown (code + comments). (note that in the above code I applied some preprocessing to the data, check ?mtcars to see the description of the original data).

Transforming data with dplyr

See lectures 5-6 for details.

Exersise 4

For the mtcars data:

Add the answer to your R markdown (code + comments).